AITopics

Country:

Europe > Ukraine > Kyiv Oblast > Kyiv (0.06)
North America > United States > Texas (0.05)

Technology: Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.42)

Neural Information Processing SystemsFeb-18-2026, 11:49:07 GMT

MEQA: A Benchmark for Multi-hop Event-centric Question Answering with Explanations

However, there's a lack of insight into how these models perform in terms

large language model, machine learning, question answering, (21 more...)

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > Ukraine > Kyiv Oblast > Kyiv (0.05)
Asia > Middle East > Syria (0.04)
(18 more...)

Genre: Research Report > New Finding (0.46)

Industry:

Government (1.00)
Law (0.93)
Leisure & Entertainment > Sports > Basketball (0.67)
Law Enforcement & Public Safety (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Question Answering (0.84)
(2 more...)

Neural Information Processing SystemsFeb-16-2026, 23:51:31 GMT

WhodunitBench: Evaluating Large Multimodal Agents via Murder Mystery Games

Recently, large language models (LLMs) have achieved superior performance, empowering the development of large multimodal agents (LMAs).

artificial intelligence, large language model, natural language, (17 more...)

Country:

Asia > China > Guangdong Province > Shenzhen (0.04)
Europe > Iceland > Capital Region > Reykjavik (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.46)

Industry:

Law (1.00)
Information Technology (1.00)
Leisure & Entertainment > Games > Computer Games (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.93)

Neural Information Processing SystemsFeb-14-2026, 07:56:54 GMT

72393bd47a35f5b3bee4c609e7bba733-Supplemental-Conference.pdf

large language model, machine learning, reasoning process, (18 more...)

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > Maryland > Baltimore (0.04)
(2 more...)

Genre: Workflow (0.94)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
(2 more...)

Neural Information Processing SystemsFeb-14-2026, 07:56:50 GMT

Deductive Verification of Chain-of-Thought Reasoning

To facilitate this procedure, we propose Natural Program, a natural language-based deductive reasoning format. Our approach enables models to generate precise reasoning steps where subsequent steps are more rigorously grounded on prior steps.

large language model, machine learning, reasoning process, (18 more...)

Country:

North America > United States > California > San Diego County > San Diego (0.04)
Asia > Middle East > Jordan (0.04)
North America > United States > Maryland > Baltimore (0.04)
(2 more...)

Genre: Workflow (0.94)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Cognitive Science > Problem Solving (1.00)
(2 more...)

arXiv.org Artificial IntelligenceDec-11-2025

Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics

Song, Maojia, Liu, Renhang, Wang, Xinyu, Jiang, Yong, Xie, Pengjun, Huang, Fei, Zhou, Jingren, Herremans, Dorien, Poria, Soujanya

RAG (Retrieval-Augmented Generation) systems and web agents are increasingly evaluated on multi-hop deep search tasks, yet current practice suffers from two major limitations. First, most benchmarks leak the reasoning path in the question text, allowing models to follow surface cues rather than discover reasoning chains autonomously. Second, evaluation is typically reduced to a single pass rate, which collapses diverse behaviours into one score and obscures whether failures stem from inadequate search, poor knowledge use, or inappropriate refusal. To address these issues, we present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox that ensures full traceability of model actions, and a holistic evaluation framework that separates search sufficiency, knowledge utilisation, and refusal behaviour. Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures: models struggle with knowledge utilisation despite having sufficient evidence and demonstrate near-absent appropriate refusal when evidence is lacking. These patterns expose a fundamental gap: today's systems excel at executing given reasoning paths but fail when required to discover them. We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies, incorporating verification loops and systematic evidence tracking that improve both search and synthesis capabilities. This baseline demonstrates that WebDetective's diagnostic framework can guide concrete architectural improvements, establishing our benchmark as a critical tool for developing genuinely autonomous reasoning systems rather than pattern-following agents.

large language model, machine learning, natural language, (22 more...)

2510.05137

Genre: Research Report > New Finding (0.46)

Industry: Leisure & Entertainment (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(2 more...)

arXiv.org Artificial IntelligenceDec-10-2025

MM-CoT:A Benchmark for Probing Visual Chain-of-Thought Reasoning in Multimodal Models

Zhang, Jusheng, Cai, Kaitong, Guo, Xiaoyang, Liu, Sidi, Lv, Qinhan, Chen, Ruiqi, Yang, Jing, Fan, Yijia, Sun, Xiaofei, Wang, Jian, Chen, Ziliang, Lin, Liang, Wang, Keze

The ability to perform Chain-of-Thought (CoT) reasoning marks a major milestone for multimodal models (MMs), enabling them to solve complex visual reasoning problems. Y et a critical question remains: is such reasoning genuinely grounded in visual evidence and logically coherent? Existing benchmarks emphasize generation but neglect verification, i.e., the capacity to assess whether a reasoning chain is both visually consistent and logically valid. T o fill this gap, we introduce MM-CoT, a diagnostic benchmark specifically designed to probe the visual grounding and logical coherence of CoT reasoning in MMs. Instead of generating free-form explanations, models must select the sole event chain that satisfies two orthogonal constraints: (i) visual consistency, ensuring all steps are anchored in observable evidence, and (ii) logical coherence, ensuring causal and commonsense validity. Adversarial distractors are engineered to violate one of these constraints, exposing distinct reasoning failures. W e evaluate leading vision-language models on MM-CoT and find that even the most advanced systems struggle, i.e., revealing a sharp discrepancy between generative fluency and true reasoning fidelity. MM-CoT shows low correlation with existing benchmarks, confirming that it measures a unique combination of visual grounding and logical reasoning. This benchmark provides a foundation for developing future models that reason not just plausibly, but faithfully and coherently within the visual world.

large language model, machine learning, natural language, (18 more...)

2512.08228

Country:

Europe > Austria (0.28)
Asia > Middle East > UAE (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
(2 more...)

Leach, Caroline N., Klusty, Mitchell A., Armstrong, Samuel E., Pickarski, Justine C., Hankins, Kristen L., Collier, Emily B., Shah, Maya, Mullen, Aaron D., Bumgardner, V. K. Cody

Toward an AI Reasoning-Enabled System for Patient-Clinical Trial Matching

arXiv.org Artificial IntelligenceDec-10-2025

Screening patients for clinical trial eligibility remains a manual, time - consuming, and resource-intensive process. W e present a secure, scalable proof-of - concept system for Artificial Intelligence ( AI)- augmented patient - trial matching that addresses key implementation challenges: integrating heterogeneous electronic health record (EHR) data, facilitating expert review, and maintaining rigorous security standards. Leveraging open-source, reasoning-enabled large language models (LLMs), the system moves beyond binary classification to generate structured eligibility assessments with interpretable reasoning chains that support human-in - the - loop review. This decision support tool represents eligibility as a dynamic state rather than a fixed determination, identifying matches whe n available and offering actionable recommendations that could render a patient eligible in the future . The system aims to reduce coordinator burden, intelligently broaden the set of trials considered for each patient and guarantee comprehensive auditability of all AI - generated outputs. Introduction Applications of artificial intelligence (AI) in healthcare are increasingly focused on improving administrative efficiency and optimizing clinical workflows . Identifying relevant trials and screening them for a particular patient is traditionally manual, time - consuming, and heavily reliant on clinical expertise.

information, large language model, machine learning, (18 more...)

2512.08026

Country: North America > United States > Kentucky (0.28)

Genre: Research Report > Experimental Study (1.00)

Industry:

Health & Medicine > Therapeutic Area > Oncology (1.00)
Health & Medicine > Pharmaceuticals & Biotechnology (1.00)
Health & Medicine > Health Care Technology > Medical Record (0.70)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

arXiv.org Artificial IntelligenceDec-9-2025

VulnLLM-R: Specialized Reasoning LLM with Agent Scaffold for Vulnerability Detection

Nie, Yuzhou, Li, Hongwei, Guo, Chengquan, Jiang, Ruizhe, Wang, Zhun, Li, Bo, Song, Dawn, Guo, Wenbo

We propose VulnLLM-R, the~\emph{first specialized reasoning LLM} for vulnerability detection. Our key insight is that LLMs can reason about program states and analyze the potential vulnerabilities, rather than simple pattern matching. This can improve the model's generalizability and prevent learning shortcuts. However, SOTA reasoning LLMs are typically ultra-large, closed-source, or have limited performance in vulnerability detection. To address this, we propose a novel training recipe with specialized data selection, reasoning data generation, reasoning data filtering and correction, and testing-phase optimization. Using our proposed methodology, we train a reasoning model with seven billion parameters. Through extensive experiments on SOTA datasets across Python, C/C++, and Java, we show that VulnLLM-R has superior effectiveness and efficiency than SOTA static analysis tools and both open-source and commercial large reasoning models. We further conduct a detailed ablation study to validate the key designs in our training recipe. Finally, we construct an agent scaffold around our model and show that it outperforms CodeQL and AFL++ in real-world projects. Our agent further discovers a set of zero-day vulnerabilities in actively maintained repositories. This work represents a pioneering effort to enable real-world, project-level vulnerability detection using AI agents powered by specialized reasoning models. The code is available at~\href{https://github.com/ucsb-mlsec/VulnLLM-R}{github}.

large language model, machine learning, natural language, (15 more...)

2512.07533

Country:

North America > United States > California (0.67)
North America > United States > Illinois (0.46)

Genre: Research Report > New Finding (0.67)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Truong, Nghi, Puranam, Phanish, Koçak, Özgecan

Why They Disagree: Decoding Differences in Opinions about AI Risk on the Lex Fridman Podcast

arXiv.org Artificial IntelligenceDec-9-2025

The emergence of transformative technologies often surfaces deep societal divisions, nowhere more evident than in contemporary debates about artificial intelligence (AI). A striking feature of these divisions is that they persist despite shared interests in ensuring that AI benefits humanity and avoiding catastrophic outcomes. This paper analyzes contemporary debates about AI risk, parsing the differences between the "doomer" and "boomer" perspectives into definitional, factual, causal, and moral premises to identify key points of contention. We find that differences in perspectives about existential risk ("X-risk") arise fundamentally from differences in causal premises about design vs. emergence in complex systems, while differences in perspectives about employment risks ("E-risks") pertain to different causal premises about the applicability of past theories (evolution) vs their inapplicability (revolution). Disagreements about these two forms of AI risk appear to share two properties: neither involves significant disagreements on moral values and both can be described in terms of differing views on the extent of boundedness of human rationality. Our approach to analyzing reasoning chains at scale, using an ensemble of LLMs to parse textual data, can be applied to identify key points of contention in debates about risk to the public in any arena.

artificial intelligence, large language model, natural language, (19 more...)